Leveraging Machine Learning Models for Peptide-Protein Interaction Prediction

Peptides play a pivotal role in a wide range of biological activities through participating in up to 40% protein-protein interactions in cellular processes. They also demonstrate remarkable specificity and efficacy, making them promising candidates for drug development. However, predicting peptide-protein complexes by traditional computational approaches, such as Docking and Molecular Dynamics simulations, still remains a challenge due to high computational cost, flexible nature of peptides, and limited structural information of peptide-protein complexes. In recent years, the surge of available biological data has given rise to the development of an increasing number of machine learning models for predicting peptide-protein interactions. These models offer efficient solutions to address the challenges associated with traditional computational approaches. Furthermore, they offer enhanced accuracy, robustness, and interpretability in their predictive outcomes. This review presents a comprehensive overview of machine learning and deep learning models that have emerged in recent years for the prediction of peptide-protein interactions.


Introduction
Peptides consist of short chains of amino acids connected by peptide bonds, typically comprising 2 to 50 amino acids.One of the most critical functions of peptides is their mediation of 15-40% of protein-protein interactions (PPIs). 1 PPIs play essential roles in various biological processes within living organisms, including DNA replication, DNA transcription, catalyzing metabolic reactions and regulating cellular signal. 2Peptides have become promising drug candidates due to their ability to modulate PPIs.Over the past century, Food and Drug Administration (FDA) has approved more than 80 peptide drugs, 3 with insulin being the pioneering therapeutic peptide used extensively in diabetes treatment.Compared with the small molecules, peptide drugs demonstrate high specificity and efficacy. 4Additionally, compared with other classes of drug candidates, peptides have more flexible backbones, enabling their better membrane permeability. 4tional design of peptide drugs is challenging and costly, due to the lack of stability and the big pool of potential target candidates.Therefore, computational methodologies that have proven effective in small molecule drug design have been adapted for modelling peptideprotein interactions (PepPIs).These computational techniques include Docking, Molecular Dynamics (MD) simulations, and machine learning (ML) and deep learning (DL) models.Docking approaches enable exploration of peptide binding positions and poses in atomistic details, facilitating the prediction of binding affinities. 5However, peptides are inherently flexible and they can interact with proteins in various conformations.These conformations often change during the binding process. 6MD simulation is another approach to model the peptide-protein interaction.The peptide-protein binding and unbinding process can be studied thermodynamically and kinetically through MD simulations. 7But sampling the complex energy landscapes associated with peptide-protein interactions typically requires intensive computational resources and time.The accuracy of Docking and MD simulations both rely on the knowledge of protein structures, therefore the limited availability of peptideprotein complex structures has restricted the utility of these two approaches.
In recent years, ML and DL models have been widely used in the field of computer-aided drug design.These models offer an alternative way to address the inherent challenges associated with Docking and MD simulations in modeling PepPIs.Due to the large amount of available biological data, many ML/DL models are routinely employed to obtain sequencefunction relationship, achieving comparable predictive performance to structure-based models.This is because sequence data contains evolutionary, structural and functional information across protein space.Furthermore, compared with Docking and MD simulation, ML/DL models exhibit greater efficiency and generalizability.Trained ML/DL models are capable of predicting PepPIs in a single pass, but it's hard to do large-scale docking and MD simulations due to their resource-intensive and time-consuming nature.Moreover, with the development of interpretable models, DL models are no longer regarded as black boxes; they can provide valuable insights into residue-level contributions to peptide-protein binding predictions.
][10][11][12][13][14] They have traditionally categorized computational methods for predicting PPIs into two main classes: sequence-based and structure-based approaches.Sequence-based methods extract information only from sequence data, whereas structure-based methods rely on the information derived from peptide-protein complex structures.Recently, ML/DL models have increasingly integrated both sequence and structure information to enhance their predictive performance.
In this review, we systematically summarize the progress made in predicting PepPIs.From ML perspective, we include Support Vector Machine (SVM) and Random Forest (RF).ML models typically require manual feature extraction from sequence and structure datasets.
But DL models, including Convolutional Neural Network (CNN), Graph Convolutional Network (GCN) and Transformer, automatically extract multi-layer feature representations from data.To the best of our knowledge, this is the first review to summarize the ML/DL work for specifically predicting PepPIs.Figure 1 shows the timeline illustrating the evolution of ML/DL methods in the context of PepPIs predictions.Table 1 summarizes the details of ML/DL models discussed in this review.Utilized a multi-head reciprocal attention layer to update the embeddings of both peptide and protein; Transfer learning was applied to solve the limited protein-peptide complex structures issue PepBCL 4

MSA based transformer
Uniclust30 38 and RCSB PDB 20 Adding the peptide sequence via a poly-glycine linker to the C-terminus of the receptor monomer sequence could mimic peptide docking as monomer folding OmegaFold 37,39 protein language model Uniref50, 40 RSCB PDB, 20 CASP, 41 and CAMEO 42 AlphaFold Multimer 37,43 MSA based transformer RSCB PDB 20 and Benchmark 2 44 Improved the accuracy of predicted multimeric interfaces between two or more proteins Fine-tuned AlphaFold 45 MSA based transformer RSCB PDB 20 Leveraging and fine-tuning AF2 with existing peptide-protein binding data could improve its PepPIs predictions Machine Learning Models for Peptide-Protein Interactions Prediction Support Vector Machine (SVM).SVM is a powerful ML algorithm commonly employed for classification tasks.The objective of SVM is to determine the optimal hyperplane that effectively separates data points belonging to different classes in the feature space.The selection criteria for this optimal hyperplane aims to maximize the margins between the closest points of distinct classes, thereby minimizing misclassification rates.
SPRINT-Seq (Sequence-based prediction of Protein-peptide Residue-level INTeraction sites) is the first ML based prediction of peptide-protein binding sites only using sequence features. 15Various types of information were extracted from protein sequence to create a feature dataset, including one-hot encoded protein sequences, evolutionary information, 46 predicted accessible surface area, 47 secondary structure, 47 and physiochemical properties. 48These features were fed into a classification model, SVM, to predict the label for each residue (Figure 2).SPRINT-Seq yielded Matthews' Correlation Coefficient (MCC) of 0.326, sensitivity of 0.64 and specificity of 0.68 on an independent test set.The importance of each feature was also evaluated, the most crucial feature distinguishing binding from non-binding residues is the sequence evolution profile.This sequence-based technique's performance is comparable or better than structure-based models (Peptimap, 49 Pepite, 50 PinUp, 51 VisGrid 52 ) for peptide-binding sites prediction.
To improve the accuracy of sequence-based prediction, Zhao et al. introduced intrinsic disorder as a feature within sequence representation. 17Peptides that participate in peptideprotein interactions exhibit consistent attributes of short linear motifs, primarily found in the intrinsic disordered regions (IDRs).These attributes include short length, flexible structure and weak binding affinity. 53In addition to the novel sequence representation, they designed a consensus-based method called PepBind. 17This method combines SVM classification model with the template-based methods S-SITE and TM-SITE. 54The aggregation of these three  Ensemble Learning.In the pursuit of a more robust predictive model for proteinpeptide binding sites, Shafiee et al. adopted an ensemble-based ML classifier named SPP-Pred. 21Ensemble learning stands out as an effective strategy for handling imbalanced datasets, as it allows multiple models to collectively contribute to predictions, resulting in enhanced robustness, reduced variance, and improved generalization. 61 the SPPPred algorithm, the ensemble learning technique of bagging 62 was employed to predict peptide binding residues.The initial step in bagging involves generating various subsets of data through random sampling with replacement, a process known as bootstrapping.
For each bootstrap dataset, distinct classification models are trained, including Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Random Forest (RF).Subsequently, for each residue, the class with the majority of votes across these models is determined as the final predicted label.This ensemble method consistently demonstrates strong and comparable performance on independent test sets, with F1 score of 0.31, accuracy of 0.95, MCC of 0.23.
Other State-Of-The-Art (SOTA) Models.There are some SOTA bespoke ML models that achieve great success for the predictions of PepPIs, for example, Hierarchical statistical mechanical modeling (HSM). 22A dataset of 8 peptide-binding domain (PBD) families was applied to train and test the HSM model, including PDZ, SH2, SH3, WW, WH1, PTB, TK, and PTP, which cover 39% of human PBDs.The HSM model defines a pseudo-Hamiltonian, which is a machine-learned approximation of Hamiltonian that maps the system state to its energy. 63The predicted PepPI probability is derived from the sum of pseudo-Hamiltonian corresponding to each PBD-peptide sequence pair.In total, 9 models were developed (Figure 3a Furthermore, the HSM model provides detailed explanations of the peptide-protein binding mechanism, demonstrating a strong interpretability.Using peptide binding with HCK-SH3 domain (PDBID: 2OI3) 66 as an example, the HSM model gave a detailed examination and explanation of the peptide-SH3 domain binding mechanism.The "W114 tryptophan switch" binding motif 67 was correctly recognized by the HSM model (Figure 3b).Additionally, a conserved triplet of aromatic residues W114-Y132-Y87 was previously identified as contributing to the peptide binding with the HCK-SH3 domain (Figure 3b). 68,69However, the HSM model also found that Y89 and Y127 had similar predicted energetic profiles as W114, suggesting a new possible W-Y-Y aromatic triplet (Figure 3c).By mapping the predicted interaction energies to the complex structure, the HSM model successfully recognized the repulsive binding regions (shown in magenta) and attractive binding regions (shown in blue) (Figure 3d).The predicted attractive binding interface correctly aligns with the previously studied RT-loop and proline recognition pocket, 68,69 demonstrating the strong predictive and interpretative ability of the HSM model.

Deep Learning Models for Peptide-Protein Interactions Prediction
Convolutional Neural Network (CNN).CNN is a class of neural networks that have demonstrated the great success in processing image data. 70The design of CNN was inspired by biological visual system in humans.When humans see an image, each neuron in the brain processes information within its own receptive filed and connects with other neurons in a way to cover the entire image.Similarly, each neuron in a CNN also only processes data in its receptive field.This approach allows CNNs to dissect simpler patterns initially and subsequently assemble them into more complex patterns.A typical CNN architecture consists of three layers: the convolutional layer, the pooling layer, and the fully connected layer.In the convolutional layer, a dot product is computed between two matrices-the first being a kernel with a set of learnable parameters, and the second representing a portion of the receptive field.The kernel slides across the entire image, generating a two-dimensional representation.The pooling layer replaces the output of the convolutional layer at each location by deriving a summary statistic of the nearby outputs.This serves to reduce the size of the feature maps, subsequently decreasing training time.Finally, the fully connected layer connects the information extracted from the previous layers to the output layer and eventually classify the input into a label.The biological data could be transformed into an image-like pattern, therefore CNN could be applied to binding site identification.
Wardah et al. applied CNNs for identifying peptide-binding sites by introducing a CNNbased method named, Visual. 24In Visual algorithm, features were extracted from protein sequence, like HSE, 56 secondary structure, 71 ASA, 71 local backbone angles, 71 PSSM 46 and physicochemical properties. 72These features were stacked horizontally resulting in a feature vector with a length of 38.Visual employs a sliding window approach to capture the local context of each residue.For a given residue, the feature vectors of the three upstream and three downstream residues were combined into a matrix, resulting in a 2-dimensional array with size of 7×38.An illustrative example of the input data in an image-like format is depicted in Figure 4, showcasing the center residue Serine (S) within a window size of 7. A 7×38 image is generated as input of CNN classifier.The Visual model comprises two sets of convolutional layers, followed by a pooling layer and a fully connected layer (Figure 4).
Visual was applied to identify the peptide binding sites of protein and achieved sensitivity of 0.67 and ROC AUC of 0.73.improved by using representations that could handle the protein rotation invariance.
Graph Convolutional Network (GCN).4][75][76][77] Graph embedding 78 includes nodes (vertices) representing different entities and edges (links) representing the relationships between them.For proteins, graphs typically assign amino acids and related information as nodes, with the distances and connections between amino acids represented as edges.This approach allows for the direct observation of information from protein 3D structures without involving hand-crafted features. 13,79GCNs 80,81 are a type of neural network that can be used to learn graph embeddings.Similar to CNNs, GCNs take graph embeddings as input and progressively transform them through a series of localized convolutional and pooling layers where each layer updates all vertex features.The updated embeddings are passed through a classification layer to obtain the final classification results. 78,80GCNs have been successfully applied to protein binding site prediction, with models such as PipGCN 73 and EGCN 74 achieving great success.More recently, a number of GCN-based models have also been applied for PepPIs prediction.
InterPepRank 26 is a representative GCN that has been developed to predict the PepPIs.
In this model, billions of decoys (computational protein folding structure) were generated by the PIPER 82 docking tool as the training and testing set, respectively.The peptideprotein complexes were then represented as graphs with one-hot encoded nodes illustrating individual residues, PSSM, 83 self-entropy, 83 and one-hot encoded edges denoting the residue interactions.Both node and edge features were then passed through edge convolution layers with the output from each layer concatenated and fed into a global pooling layer and two dense layers to predict the LRMSD (ligand root-mean-square deviation) of decoys.Inter-PepRank achieved a median ROC AUC of 0.86, outperforming other benchmarking methods such as PIPER, 82 pyDock3, 84 and Zrank. 85For example, in the case of a fragment from the center of troponin I (peptide) binding with the C-terminal domain of Akazara scallop troponin C (receptor), 86 the peptide was proved to be disordered when unbound and become an ordered α-helical structure upon binding, 87 following the induced-fit binding mechanism.
Predicting the peptide binding conformation and binding sites for systems with induced-fit mechanisms is extremely challenging.The top 100 decoys predicted by both InterPepRank and Zrank showed that both methods can find the true binding site of the peptide.However, InterPepRank achieved an accuracy of 96% in predicting the peptide as an α-helical structure, while Zrank only achieved an accuracy of less than 50%, where half of the peptide decoys' secondary structures were predicted as either random coils or β-sheets.Therefore, InterPepRank is a powerful tool for predicting both binding sites and conformations, even in cases where the peptide is disordered when unbound.This is a significant advantage over other benchmarked energy-based docking methods, which may struggle with disordered structures that are more energetically favorable in unbound states or easier to fit into false positive binding sites.
Struct2Graph 29 is a novel multi-layer mutual graph attention convolutional network for structure-based predictions of PPIs (Figure 5).Coarse-grained graph embeddings were generated by two GCNs with weight sharing for both components of the protein complexes.
These embeddings were then passed through a mutual attention network to extract the relevant features for both proteins and concatenated into a single embedding vector.By calculating attention weights, residues with large learned attention weights are more important and more likely to contribute towards interaction.The vector was further passed into a feed-forward network (FFN) and a final Softmax layer to get the probability for PPI.
Struct2Graph outperformed the feature-based ML models and other SOTA sequence-based DL models, achieving an accuracy of 98.89% on positive/negative samples balanced dataset, and accuracy of 99.42% on a positive/negative samples unbalanced dataset (positive:negative = 1:10).Residue-level interpretation was conducted to identify the residues' contribution to PepPIs.For example, Staphylococcus aureus Phenol Soluble Modulins (PSMs) peptide PSMα 1 88 competes with high mobility group box-1 protein (HMGB1) to bind with toll-like receptor-4 (TLR4), 89 thus inhibiting HMGB1-mediated phosphorylation of NF-κB. 90For the PSMα 1 -TLR4 complex, Struct2Graph demonstrated impressive accuracy of 92%, and the predicted binding residues aligned with the previously identified TLR4 active binding sites.Notably, peptide residues 2Gly and 10Val were accurately predicted as the peptide binding residues.Furthermore, Struct2Graph's predictions corroborated the previously studied competitive binding mechanism, indicating that both PSMα 1 peptide and HMGB1 bind to the same area of TLR4.hydrogen bond (Figure 6b), SH or NH 2 side-chain hydrogen donor surrounded by oxygen atoms (Figure 6c), a carbon in the vicinity of a methyl group and an aromatic ring (Figure 6d), and so on.The detected pattern with solvent-exposed residues frequently appearing in the protein-protein interface (Figure 6e), such as Arginine (R), was positively correlated with the output probability of PPBS.However, that with the buried hydrophobic amino acids (Figure 6f), such as Phenylalanine (F), was negatively correlated with the output probability of PPBS.Interestingly, the pattern with exposed hydrophobic amino acid surrounded by charged amino acids, which is the hotspot O-ring 91 architecture in protein interfaces, was positively correlated with the output probability (Figure 6g).Attention based models.Recurrent neural networks (RNN) and long short-term memory (LSTM) are most common models for language modeling and machine translation. 92But both RNN and LSTM suffer from the issue of handling long range dependencies, in other words they become ineffective when there is a significant gap between relevant information and the point where it is needed.The attention mechanism was introduced to address this Existing ML and DL models for predicting peptide-protein binding sites mainly focus on identifying binding residues on the protein surface.Sequence-based methods typically take protein sequences as inputs, assuming that a protein maintains fixed binding residues across different peptide binders.However, this assumption doesn't hold true for most cellular processes, as various peptides may interact with distinct protein residues to carry out diverse functions.Structure-based methods would require a target protein structure and a peptide sequence, thus limiting their applicability to proteins with available structural data.A novel DL framework for peptide-protein binding prediction was proposed, called CAMP, 32 to address the above limitations.CAMP takes account of information from sequence of both peptides and target proteins, and also detect crucial binding residues of peptides for peptide drug discovery.
7][98][99] For each PDB complex, protein-ligand interaction predictor (PLIP) is employed to identify non-covalent interactions between the peptide and the protein, considering these interactions as positive samples for training.Additionally, PepBDB 100 aids in determining the binding residues of peptides involved in the specific protein-peptide complexes.Various features are extracted based on their primary sequences to construct comprehensive sequence profiles for peptides and proteins.These features include secondary structure, physicochemical properties, intrinsic disorder tendencies, and evolutionary information. 17,101-104CAMP utilizes two multi-channel feature extractors to process peptide and protein features separately (Figure 7).Each extractor contains a numerical channel for numerical features (PSSM and the intrinsic disorder tendency of each residue), along with multiple categorical channels for diverse categorical features (raw amino acid, secondary structure, polarity and hydropathy properties).Two CNN modules extract hidden contextual features from peptides and proteins.Self-attention layers are also employed to capture long-range dependencies between residues and assess the contribution of each residue to the final interaction.CAMP applies fully connected layers on all integrated features to predict the interaction between proteins and peptides.In addition to binary interaction prediction, CAMP can identify which residue of peptides interacts with target proteins by adding a sigmoid activation function to the output of the peptide CNN module.
Compared with three baseline models (DeepDTA, 105 PIPR, 106 NRLMF 107 ), CAMP demonstrates consistent better performance with an increase by up to 10% and 15% in terms of Area Under the Curve (AUC) and Area Under the Precision-Recall Curve (AUPR).To evaluate its ability to identify binding residues of peptides, the predicted label of each residue of the peptide is compared with real label for four existing peptide binders.The results shows that CAMP correctly predicts binding residues and thus provides reliable evidence for peptide drug design.In contrast, PepNN-Seq only takes the protein and peptide sequence as inputs (Figure 8b).

Instead of only applying self-attention layer, Adbin et al. developed a Transformer-based
In the PepNN algorithm, the encoding of the peptide sequence is independent from the protein encoding module, under the assumption that the peptide sequence carries all the necessary information regarding peptide-protein binding.However, in many scenarios, the peptide sequence is not sufficient to determine the bound conformation, as the same peptide can adopt different conformations when bound to different proteins. 108Motivated by this, PepNN incorporates a multi-head reciprocal attention layer that simultaneously updates the embeddings of both the peptide and protein (Figure 8a).This module attempts to learn the interactions between protein and peptide residues involved in binding.
Another challenge in predicting the protein-peptide binding sites is the limited availability of protein-peptide complex training data.Protein-protein complex information was added to the training set to overcome the limited data issue.Notably, not entire proteinprotein complex data was included, because the interactions between two proteins can be mediated by a linear segment in one protein that contribute to the majority of the interface energy.Pre-training of the model was conducted using a substantial dataset of large protein fragment-protein complexes (717,932). 109Fine-tuning of the model then took place with a smaller set of peptide-protein complexes (2,828), resulting in a considerable enhancement in predictive performance, particularly for the PepNN-Struct model (Figure 8c).8][19] PepNN-Struct surpassed most peptide binding site prediction approaches, achieving a higher AUC score.While PepNN generally exhibits lower MCC than the SOTA method AlphaFold-Multimer in most cases, its independence from multiple sequence alignments may render PepNN more suitable for modeling synthetic PepPIs.
While numerous computational methods have been developed for predicting peptideprotein binding site, many of them need complex data preprocessing to extract features, often resulting in reduced computational efficiency and predictive performance.Wang et al.
developed an end-to-end predictive model that is independent of feature engineering named PepBCL. 4This innovative approach leverages pre-trained protein language models to distill knowledge from protein sequences that are relevant to protein structures and functions.
Another challenge encountered in identifying protein-peptide binding sites is the issue of imbalanced data.Current work typically construct a balanced dataset by using undersampling techniques.However, these techniques remove samples from the majority class to match the size of minority class.In PepBCL algorithm, a contrastive learning-based module is introduced to tackle this problem.Unlike conventional under-sampling methods, the contrastive learning module adaptively learn more discriminative representations of the The PepBCL architecture is composed of four essential modules: sequence embedding module, BERT-based encoder module, 94 output module and contrastive learning module. 110,111 the sequence embedding module, each amino acid of the query sequence is encoded into a pre-trained embedding vector, while the protein sequence is encoded to an embedding matrix.In the BERT-based encoder module, the output from the sequence embedding module undergoes further encoding through BERT to generate a high dimensional representation vector. 112The representation vector is then passed through a fully connected layer.In the contrastive learning module, the contrastive loss between any two training samples is optimized to generate more discriminative representations of the binding residues.In the output module, the probability of each residue being in a binding site is calculated (Figure 9a).
When compared with the existing sequence-based method (SPRINT-Seq, 15 PepBind, 17 Visual, 24 and PepNN-Seq 34 ), PepBCL achieves a significant improvement in the precision by 7.1%, AUC by 2.2%, and MCC by 1.3% over best sequence predictor PepBind. 17Furthermore, PepBCL also outperforms all structure-based methods (i.e.Pepsite, 50 Peptimap, 49 SPRINT-Str, 18 and PepNN-Struct 34 ) in terms of MCC.The superior performance of Pep-BCL indicates that DL approaches can automatically learn features from protein sequence to distinguish peptide binding residues and non-binding residues, eliminating the reliance on additional computational tools for feature extraction.When assessing various methods using evaluation metrics, it is observed that recall and MCC tend to be notably low due to the extreme class imbalance in the dataset.This suggests that many true protein-peptide binding residues may be overlooked.However, PepBCL demonstrates improved recall and MCC values, highlighting the effectiveness of the contrastive module in identifying more true peptide binding residues.This enhancement can be attributed to the contrastive learning's ability to extract more discriminative representations, particularly in imbalanced datasets.AlphaFold/RoseTTAFold/OmegaFold/ESMFold.Multiple Sequence Alignment (MSA)-based transformer models such as AlphaFold2 (AF2, including monomer model 35 and multimer model 43 ), RoseTTAFold, 113 and protein Language Model (pLM)-based models such as OmegaFold, 39 and ESMFold, 114 have demonstrated remarkable success in predicting the in silico folding of monomeric proteins and peptides. 115However, PepPIs are relatively flexible protein complexes, making it challenging to achieve highly accurate predictions.Therefore, benchmarking these SOTA DL techniques on PepPI predictions could provide structural insights into peptide-protein complexes, for example, binding affinities, conformational dynamics, and interaction interfaces, thus contributing to the advancement of molecular biology and drug discovery.al. 36 The PepPIs could be represented as the folding of a monomeric protein by connecting the peptide to the C-terminus of the receptor with a poly-glycine linker (Figure 10a), which forms a general idea of how to perform peptide-protein docking using the AF2 monomer model.This method can not only identify the peptide binding regions but also accommodate binding-induced conformational changes of the receptor.AF2 surpassed RoseTTAFold since the latter tended to fold the polyglycine linker into a globular structure or various interactive loops.For a small dataset of 26 PepPI complexes, AF2 achieved a relatively high accuracy (75%) for complexes whose binding motifs have been experimentally characterized.
AF2 also outperformed another peptide docking method PIPER-FlexPepDock (PFPD) 116 in terms of both accuracy and speed.Furthermore, accurate predictions were achieved with AF2 pLDDT values above 0.7, further verifying that AF2 monomer can reliably predict the PepPIs.However, the predicted accuracy became lower (37%) when tested on a larger dataset (96 complexes), indicating that further improvements are needed for more accurate PepPI predictions by AF2 monomer.
The recent release of AF2 multimer has yielded a major improvement in PepPIs prediction (Figure 10b).Using a set of 99 protein-peptide complexes, Shanker et al 37 compared the performance of AF2 monomer, AF2 multimer, and OmegaFold on PepPI prediction with their peptide docking software AutoDock CrankPep (ADCP). 80The new AF2 multimer model with 53% accuracy, which was trained to predict the interfaces of multimeric protein complexes, outperformed OmegaFold with 20% accuracy and ADCP with 23% accuracy (Figure 10c).However, the AF2 multimer model is only limited to linear peptides, reducing its applicability to cyclized peptides, or peptides with non-standard amino acids.Effective selection from top-ranked poses yielded by both AF2 multimer and ADCP docking tool was found to further enhance the accuracy to 60%.Therefore, DL protein structure prediction models, especially AF2 multimer, have achieved high-accuracy in PepPIs predictions, though limitations exist.Combining these SOTA DL models with traditional peptide docking tools could be a future direction for further improving the accuracy of PepPIs predictions.
Leveraging the highly accurate predictions of protein structures by AF2, Amir Motmaen et al 45 developed a more generalized model for the prediction of PepPIs.The model was accomplished by placing a classifier on top of the AF2 network and fine-tuning the combined network (Figure 10d).AF2 was able to achieve optimal performance and generate the most accurate complex predicted structure models for a large dataset of peptide-Major Histo-compatibility Complex (MHC) complexes.This was accomplished by aligning the peptide sequence with the peptide-protein crystal structures as templates.However, AF2 occasional docking of non-binding peptides in the peptide binding domain of MHC highlighted the need for a clear classification of binder and non-binder peptides in the training of the model.
To address this issue, a logistic regression layer that normalizes AF2 Predicted Aligned Er- In conclusion, ML/DL-guided methods have shown significant potential for the accurate predictions of peptide-protein complex structures and binding sites.These SOTA models will undoubtedly further accelerate the process of peptide drug discovery and design.

Figure 1 :
Figure 1: Timeline of recent Machine Learning and Deep Learning methods for PepPIs prediction.

Figure 2 :
Figure 2: The input features and architecture of SPRINT-Seq.G-SEQ: sequence feature.G-PF: Sequence profile from Position Specific Scoring Matrix (PSSM).G-SS: Secondary Structure-based features.G-ASA:Accessible Surface Area-based features.G-PP7: Physicochemical-based feature group.
), including 8 separate HSM/ID models (ID means independent domain, one for each protein family) and a single unified HSM/D model covering all families (D means domains).The HSM model remarkably outperformed other ML models such as NetPhorest 64 and PepInt. 65By computing the energies from pseudo-Hamiltonian, the HSM model can evaluate and rank the possibilities of different PepPI patterns, facilitating the verification of existing PepPI ensembles and the discovery of new possible PepPI ensembles.

Figure 3 :
Figure 3: Details of Hierarchical statistical mechanical (HSM) model.(a) Mechanism of HSM models of PBD-peptide interactions: HSM/ID for independent domains (left to right) and HSM/D for domains (right to left).Model extent denoted by the black bars.(b) Structure of peptide (black) bound with the HCK SH3 domain (PDB ID: 2OI3).Different colors for the HCK SH3 domain represent different domain residue clusters.The W-Y-Y aromatic triplet residues (Y87, W114, Y132) and specificity-defining loops (RT, n-Src) in the HCK SH3 domain are labeled.(c) W114 tryptophan switch binding motif and new possible W-Y-Y aromatic triplet (Y89, W114, Y127) found by HSM model.Energy potentials for the interaction of W114 and Y89 with a single peptide position show strong similarity.(d) Energy surfaces between the HCK SH3 domain and peptide (sequence: HSKYPLPPLPSL).Each domain residue is colored by the mean predicted interaction energy with peptide residues lying within 2.5 Å.

Figure 4 :
Figure 4: The workflow of Visual model.(a) Transforming protein sequence into 7×38 input image (per residue).In order from left to right of image: 3 pixels represents Half Sphere Exposure (HSE), 56 3 pixels represent the predicted probabilities of different secondary structure, 1 pixel represents the Accessible Surface Area (ASA) value, 4 pixels represent the local backbone angles, 20 pixels represent the Position Specific Scoring Matrix (PSSM), and 7 pixels represent the physicochemical properties of the amino acids.(b) Training and optimizing hyperparamters of CNN.(c) Testing the optimized CNN on unseen test data to predict the label of each residue (binding/non-binding).

Figure 5 :
Figure 5: Struct2Graph model architecture.Struct2Graph model loads graph embeddings of both components into two weight sharing graph convolutional networks (GCNs) seperately.GCNs outputs are integrated into a mutual attention network to predict the probability of PPI and the interaction sites.
2D t-distributed stochastic neighbor embedding (t-SNE) projections further verified that the model has already learned various amino acid-level structural features.2D t-SNE projections on secondary structures (Figure 6h) clearly illustrated that the model has learned the secondary structural information of the training complexes.With the multi-level knowledge of protein structures, ScanNet captures the underlying chemical principles of protein-protein binding.This SOTA interpretable DL model aids in a deeper understanding of PepPIs and PPIs.

Figure 6 :
Figure 6: (a) Overview of the ScanNet model architecture.Point cloud including neighboring atoms information was first extracted for each atom from the protein structure.Point cloud was then passed through linear filters to detect specific atom interaction patterns, and yielding an atomic-scale representation.This representation was pooled to amino acid scale, concatenated with the extracted neighboring amino acid attributes from the protein structure, and then applied to similar procedure as before to identify amino acid neighborhood and representations.(b-f) Each panel shows one learned atom-level spatio-chemical patterns on the left and corresponding top-activating neighborhood on the right.(b) N-H-O hydrogen bond, (c) two oxygen atoms and three NH groups in a specific arrangement, (d) a carbon in the vicinity of a methyl group and an aromatic ring.(e-g) Each panel shows one learned amino acid-level spatio-chemical pattern on the left and one corresponding top-activating neighborhood on the right.(e) solvent-exposed residues, positively correlated with the output probability (r=0.31),(f) buried hydrophobic amino acids, negatively correlated with the output probability (r=-0.32),(g) The hotspot O-ring architecture, exposed hydrophobic amino acid surrounded by exposed, charged amino acids, positively correlated with the output probability (r=0.29).(h) Two-dimensional projection on secondary structure of the learned amino acid scale representation using t-SNE.

Figure 7 :
Figure7: The network architecture of CAMP.For each protein-peptide pair, the numerical and categorical features of peptide and protein sequences are extracted and fed into CNN modules.The outputs of the amino acid representations of the peptide and protein are also fed into the self-attention modules to learn the importance of individual residue to the final prediction.Then taking the outputs of CNN and self-attention modules together as input of three fully connected layers to predict the a binding score for each peptide-protein pair.The output of CNN modules is also used for predicting a binding score for each residue from peptide sequence.

Figure 8 :
Figure 8: The model architecture and training procedure of PepNN.(a) The input of PepNN-Struct and model architecture.Attention layers are indicated with orange; normalization layers are indicated with blue and simple transformation layers are indicated with green.(b) The input of PepNN-Seq.(c) Transfer learning pipeline used for training PepNN.

Figure 9b visually demonstrates
Figure 9b visually demonstrates the learned feature space with and without the contrastive learning module, showcasing a clearer distribution of binding and non-binding residues in the feature space.

Figure 9 :
Figure 9: (a) Architecture of PepBCL consists of four modules.Sequence embedding module: convert protein sequence to sequence embedding for each residue; BERT-based encoder module: extract high-quality representations of each residue in protein; Output module: predict the label (binding/non-binding) of residues using fully connected layers; and contrastive learning module: obtain more distinguishable representations by minimizing contrastive loss.(b) t-SNE visualization of the feature space distribution of PepBCL with/without contrast module on testing dataset

Figure 10 :
Figure 10: (a) A successful example (PDBID: 1SSH) of peptide-protein docking with a poly-glycine linker via AlphaFold2.This method can dock the peptide at the correct position (native peptide is shown in black, docking ppeptides are shown in other colors) and identify the linker as unstructured region (modeled as a circle).(b) Peptide-Protein complex structure that is successfully predicted with the AlphaFold2-Multimer.The ground truth structures are shown in green and predicted structures are colored by chain.(c) AlphaFold2-Multimer model outperforms other DL approaches and achieves remarkable docking success rates of 53% for peptides-protein docking.A designed docking approach combining ADCP and AlphaFold2-Multimer achieves an improved success rates of 60%.(d) Mechanism of structure prediction networks for peptide binder classification by fine-tuning AlphaFold2.The input of the model includes the peptide binder and non-binder sequences, protein sequences, and peptide-protein co-crystal structures as templates.After positionally aligning the peptide sequence to the template, the complex structure is then predicted with Al-phaFold2.A binder classification layer converts the AlphaFold2 output PAE values into a binder/non-binder score.The combined loss function including the structure loss over the entire complex for peptide binder and over protein only for non-binder, and classification loss from the binder classification layer, is used for model training.
ror (PAE) score into binder/non-binder score was placed on top of AF2.This resulted in three types of losses being combined and applied to further fine-tune the combined model: structure loss on both peptide and protein for binding peptide-protein complexes, structure loss on protein only for non-binding peptide-protein complexes, and classification loss on binding/non-binding score.The evaluation of the combined model showed a ROC AUC of 0.97 for Class I and 0.93 for Class II peptide-MHC interactions.Surprisingly, the fine-tuned model outperformed the previously mentioned HSM model and could also be generalized on PDZ domains (C-terminal peptide recognition domain) and SH3 domains (proline-rich peptide binding domain), despite being trained and fine-tuned only on the peptide-MHC dataset.Therefore, taking advantage of the accurate predictions of protein structures through AF2, and fine-tuning the model with existing peptide-protein binding data offers significant boost to PepPIs predictions.Conclusions and Future Research DirectionsPeptides, which are short proteins consisting of around 2 to 50 amino acids, are known for their flexibility.This characteristic makes it challenging to achieve highly accurate predictions of PepPIs.A variety of SOTA ML and DL models summarized in this review have been designed and applied to predict PepPIs, which are key to de novo peptide drug design.Apart from their well-documented high efficiency and accuracy requirements, ML/DL methods offer several other advantages in the predictions of PepPIs.Compared to Docking or MD Simulation methods, ML or DL methods offer diverse options for model in-puts.DL methods, such as transformers and language models, have been shown to achieve great success in predicting PepPIs solely on sequence information.Instead of original sequence or structure information, ML methods can also incorporate multi-level information such as evolutionary information, secondary structures, solvent accessible surface area, and so forth, which could significantly enhance the accuracy of the prediction.Furthermore, more interpretability can be provided by ML/DL methods.Attention mechanism assists in demonstrating the internal dependencies between residues and the contribution of each residue to PepPIs.Graph models capturing multi-scale structure information of peptides and proteins are able to provide insights into the underlying peptide-protein binding chemical principles and binding patterns.Moreover, ML/DL techniques exhibit a degree of generalizability.Transfer learning could facilitate the models trained on certain peptide-protein binding datasets to generalize to other peptide-protein complexes.Despite their numerous advantages, ML and DL methods also have certain limitations in the prediction of PepPIs, which highlight potential areas for future research.One significant challenge is the issue of imbalanced datasets in the training and testing of PepPIs prediction models.Given that peptide binding is typically a rare occurrence, the imbalanced number of positive and negative samples often results in the limited performance of ML/DL models due to the poor understanding of the minority binding class.Consequently, ML/DL methods for PepPI predictions were normally trained based on datasets with positive-tonegative ratio as 1:1.Both oversampling methods, which duplicate or create new samples, and undersampling methods, which delete or merge samples in the majority class can enhance the model performance on imbalanced classification.Additionally, ML/DL methods often failed in the prediction of PepPIs between intrinsically disordered peptides (IDP) and proteins.IDPs are abundant in nature, with flexible and disordered structures but adopt stable and well-defined structures upon binding.In these cases, ML/DL methods, particularly structure-based models, tend to fail in predicting binding sites and peptide binding conformations, offering little insights into the binding mechanism.With the enhancement of computing power, high-throughput MD simulations can achieve more accurate predictions of binding sites and peptide/protein conformations as well as a deeper understanding of the mechanism of folding and binding, induced fit (binding then folding), or conformational selection (folding then binding).The integration of MD or quantum chemical insights and ML/DL methods could constitute a promising future research direction of PepPIs predictions.Furthermore, some advanced techniques like transfer learning or one-shot learning models can also be applied for address the low data issue in the PepPIs prediction.117,118In addition to enhancing the predictive accuracy of established ML and DL models, future research directions should prioritize the enhancement of model's ability to generate novel peptide sequences to specific target proteins of interest, thereby contributing to de novo peptide drug design.An essential way is to fine-tune pre-trained pLM.Introducing noises and perturbations within the peptide latent space of pLM, or masking peptide sequences to facilitate the model to learn the probability distribution of peptide binders, could be explored to generate entirely new peptide sequences.Additionally, diffusion models offer another avenue for achieving the generative tasks.These models possess a deeper understanding of the intricate molecular interactions at the atomic levels, thus enabling the generation of new peptide sequences based on peptide-protein complex structures.The resultant novel peptide sequences can be subsequently validated through MD simulations, in vitro, and in vivo experimental tests.Therefore, developing new generative models or leverage the pre-trained ML/DL models to facilitate peptide generation represents a noteworthy and promising future for advancing peptide drug design.